Python for Data Science

I wrote this notebook while figuring out what tools Python offers to solve data science problems i.e. those kind of programming task where spreadsheet like data has to be handled and analyzed with a statistical set of mind. Consequently the following code snippets are mainly using packages like numpy, pandas and matplotlib. It comprises the topics:

  • Random Numbers
  • Numpy Arrays and Matrices
  • Broadcasting of values in matrices
  • Pandas Series and DataFrames
  • I/O for DataFrames
  • Combining, Merging and Rearranging of DataFrames
  • Histograms
  • Plotting
  • Data Grouping and Aggregation
  • Pivot and Cross Tables
  • Handling of Time Series and Finanzial Data

Acknowledgment: Lots of the code down below is inspired by the very helpfull book "Python for Data Analysis" by Wes McKinney published at O'Reilly.

(c) Florian Hoppe 2013


In [4]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rand
from datetime import datetime, timedelta
import dateutil
import misc as hlp
import scipy

In [4]:
%pylab inline


Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Random Numbers:


In [101]:
rand.randint(0,2,size=(5,2))


Out[101]:
array([[0, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 0]])

In [102]:
rand.normal(0, 1, size=(10,2))


Out[102]:
array([[-0.09675095,  1.41022276],
       [ 0.08331002, -1.01390688],
       [-1.02670791,  0.78575858],
       [-0.56622282,  1.28850691],
       [-1.44063462, -0.23504318],
       [-2.60446661,  0.46358683],
       [ 0.08546987, -2.18813767],
       [ 0.59816834, -0.70568697],
       [-1.06188424,  0.34101838],
       [ 0.65117497, -0.34601664]])

In [110]:
random_data = np.concatenate([rand.normal(0,.5,(100,2)),rand.normal(5,1,(200,2))])
plt.scatter(random_data[:,0],random_data[:,1])


Out[110]:
<matplotlib.collections.PathCollection at 0xa7eb940>

In [33]:
np.random.randn(2,3) # normal distribution
np.random.rand(2,3)  # uniform distribution


Out[33]:
array([[ 0.6062381 ,  0.73490793,  0.54977326],
       [ 0.37106928,  0.74039583,  0.96761176]])

In [28]:
np.random.permutation(10)


Out[28]:
array([4, 0, 6, 2, 7, 1, 3, 5, 8, 9])

In [120]:
t = np.arange(5)
np.random.shuffle(t)
t


Out[120]:
array([0, 4, 2, 1, 3])

In [2]:
n = 10000
v1 = rand.normal(0,.5,n)
v2 = rand.normal(10,.75,n)
v3 = v1*3+10
v4 = rand.uniform(-10,10,n)
v5 = v1*v2
v6 = v1+v2+v3
mat = np.concatenate([v1,v2,v3,v4,v5,v6]).reshape(6,n).T
mat.shape


Out[2]:
(10000L, 6L)

In [59]:
v1.std()==v1.var()**.5


Out[59]:
True

In [3]:
np.cov(mat.T)


Out[3]:
array([[  2.47965623e-01,   4.08848491e-03,   7.43896868e-01,
          1.98422062e-02,   2.48179344e+00,   9.95950976e-01],
       [  4.08848491e-03,   5.61573123e-01,   1.22654547e-02,
          3.87392111e-02,   3.97894902e-02,   5.77927062e-01],
       [  7.43896868e-01,   1.22654547e-02,   2.23169060e+00,
          5.95266187e-02,   7.44538032e+00,   2.98785293e+00],
       [  1.98422062e-02,   3.87392111e-02,   5.95266187e-02,
          3.33664381e+01,   2.45949459e-01,   1.18108036e-01],
       [  2.48179344e+00,   3.97894902e-02,   7.44538032e+00,
          2.45949459e-01,   2.49790676e+01,   9.96696326e+00],
       [  9.95950976e-01,   5.77927062e-01,   2.98785293e+00,
          1.18108036e-01,   9.96696326e+00,   4.56173096e+00]])

In [4]:
np.corrcoef(mat.T)


Out[4]:
array([[ 1.        ,  0.01095629,  1.        ,  0.00689826,  0.99719888,
         0.93643447],
       [ 0.01095629,  1.        ,  0.01095629,  0.00894937,  0.01062373,
         0.36108134],
       [ 1.        ,  0.01095629,  1.        ,  0.00689826,  0.99719888,
         0.93643447],
       [ 0.00689826,  0.00894937,  0.00689826,  1.        ,  0.00851928,
         0.00957326],
       [ 0.99719888,  0.01062373,  0.99719888,  0.00851928,  1.        ,
         0.93370549],
       [ 0.93643447,  0.36108134,  0.93643447,  0.00957326,  0.93370549,
         1.        ]])

In [5]:
df = DataFrame({'v1':v1,'v2':v2,'v3':v3,'v4':v4,'v5':v5,'v6':v6})
df[:3]


Out[5]:
v1 v2 v3 v4 v5 v6
0 0.856301 9.804413 12.568904 -3.444391 8.395533 23.229618
1 0.168428 10.670377 10.505283 5.160607 1.797187 21.344088
2 -0.098325 8.586418 9.705026 7.626106 -0.844257 18.193119

In [77]:
df.cov()


Out[77]:
v1 v2 v3 v4
v1 0.249482 -0.001939 0.748447 -0.001653
v2 -0.001939 0.569149 -0.005817 -0.053476
v3 0.748447 -0.005817 2.245341 -0.004959
v4 -0.001653 -0.053476 -0.004959 33.431387

In [6]:
df.cov()
# corr is standardized cov: each cov value has to be devided by the product of the std dev of the correlated variables
# df.cov().div(df.std()[0]*df.std()[1])


Out[6]:
v1 v2 v3 v4 v5 v6
v1 0.247966 0.004088 0.743897 0.019842 2.481793 0.995951
v2 0.004088 0.561573 0.012265 0.038739 0.039789 0.577927
v3 0.743897 0.012265 2.231691 0.059527 7.445380 2.987853
v4 0.019842 0.038739 0.059527 33.366438 0.245949 0.118108
v5 2.481793 0.039789 7.445380 0.245949 24.979068 9.966963
v6 0.995951 0.577927 2.987853 0.118108 9.966963 4.561731

In [26]:
df.corr()


Out[26]:
v1 v2 v3 v4 v5 v6
v1 1.000000 0.010956 1.000000 0.006898 0.997199 0.936434
v2 0.010956 1.000000 0.010956 0.008949 0.010624 0.361081
v3 1.000000 0.010956 1.000000 0.006898 0.997199 0.936434
v4 0.006898 0.008949 0.006898 1.000000 0.008519 0.009573
v5 0.997199 0.010624 0.997199 0.008519 1.000000 0.933705
v6 0.936434 0.361081 0.936434 0.009573 0.933705 1.000000

In [8]:
U,S,V = np.linalg.svd(df)

In [16]:
Series(S)


Out[16]:
0    2.461739e+03
1    5.776395e+02
2    5.118762e+02
3    5.232475e+01
4    4.228912e+00
5    3.276289e-12
dtype: float64

In [14]:
V = DataFrame(V)
V[V<0.001]=0
V


Out[14]:
0 1 2 3 4 5
0 0.000000 0.0 0.000000 0.000000 0.00000 0.000000
1 0.002499 0.0 0.000000 0.999621 0.02654 0.000000
2 0.096563 0.0 0.106076 0.000000 0.97153 0.016809
3 0.000000 0.0 0.695710 0.001045 0.00000 0.000000
4 0.860556 0.0 0.000000 0.000000 0.00000 0.286849
5 0.500000 0.5 0.500000 0.000000 0.00000 0.000000

In [27]:
df.corr()


Out[27]:
v1 v2 v3 v4 v5 v6
v1 1.000000 0.010956 1.000000 0.006898 0.997199 0.936434
v2 0.010956 1.000000 0.010956 0.008949 0.010624 0.361081
v3 1.000000 0.010956 1.000000 0.006898 0.997199 0.936434
v4 0.006898 0.008949 0.006898 1.000000 0.008519 0.009573
v5 0.997199 0.010624 0.997199 0.008519 1.000000 0.933705
v6 0.936434 0.361081 0.936434 0.009573 0.933705 1.000000

In [29]:
stddf = df.sub(df.mean()).describe()
print(stddf.cov())


                 v1               v2               v3               v4  \
v1  12499774.901795  12499761.023268  12499550.957176  12498879.460149   
v2  12499761.023268  12499747.356090  12499538.057453  12498867.884048   
v3  12499550.957176  12499538.057453  12499331.626895  12498666.028934   
v4  12498879.460149  12498867.884048  12498666.028934  12498012.387348   
v5  12499221.653703  12499212.106789  12499017.962439  12498372.752814   
v6  12499501.996624  12499489.483402  12499284.438758  12498621.464544   

                 v5               v6  
v1  12499221.653703  12499501.996624  
v2  12499212.106789  12499489.483402  
v3  12499017.962439  12499284.438758  
v4  12498372.752814  12498621.464544  
v5  12498757.668284  12498976.866422  
v6  12498976.866422  12499237.969240  

In [30]:
latent, coeff = linalg.eig(stddf.cov())

In [35]:
DataFrame(coeff)


Out[35]:
0 1 2 3 4 5
0 0.408259 0.427506 -0.097629 -0.297606 0.717532 -0.193934
1 0.408258 0.368728 -0.099061 -0.487072 -0.670932 -0.012789
2 0.408252 0.153725 -0.178969 0.736024 -0.166302 -0.456380
3 0.408230 -0.215985 0.885130 -0.001184 0.000964 -0.056935
4 0.408242 -0.780292 -0.387778 -0.231219 0.047279 -0.135699
5 0.408250 0.046274 -0.121654 0.281065 0.071459 0.855738

For sparse matrices:


In [6]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rand
from datetime import datetime, timedelta
import dateutil
from scipy.stats import pareto
import scipy.stats
import random as rnd

In [29]:
vals = np.array([1,2,3],dtype=float64)
rowi = np.array([1,5,10])
coli = np.array([5,13,100])
sparseM = scipy.sparse.coo_matrix((vals,(rowi,coli)),shape=(max(rowi)+1,max(coli)+1))

In [40]:
sparseM.shape


Out[40]:
(11, 101)

In [31]:
(U,d,V) = scipy.sparse.linalg.svds(sparseM) # use parameter 'k=' to limit number of computed singular vectors

In [39]:
U.shape


Out[39]:
(11, 6)

In [35]:
d.shape


Out[35]:
(6,)

In [34]:
V.shape


Out[34]:
(6, 101)

In [ ]:
sparseM == U.dot(diag(d)).dot(V.transpose())

Arrays & Matrices:


In [195]:
# false tolerant value retieval:
dict = {'a':1,'b':2}
dict.get('t',-1)  # return -1 if key is not in the dictionary


Out[195]:
-1

In [2]:
x = np.array(range(3))
print(x.dot(x))
x * x   # elementwise *


5
Out[2]:
array([0, 1, 4])

In [8]:
x = np.array([0, 1])
y = np.array([2, 3])
x * y

In [2]:
e = np.identity(3)
e


Out[2]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [3]:
m = np.arange(6).reshape(3,2)
m


Out[3]:
array([[0, 1],
       [2, 3],
       [4, 5]])

In [133]:
m[0,:]


Out[133]:
array([0, 1])

In [128]:
m[:,:1]


Out[128]:
array([[0],
       [2],
       [4]])

In [7]:
# elementwise multiplication
np.multiply(e,np.arange(9).reshape((3,3))) # == e * np.arange(9).reshape((3,3))


Out[7]:
array([[ 0.,  0.,  0.],
       [ 0.,  4.,  0.],
       [ 0.,  0.,  8.]])

In [125]:
# real matrix multiplication
np.dot(e,m)


Out[125]:
array([[ 0.,  1.],
       [ 2.,  3.],
       [ 4.,  5.]])

In [130]:
E = np.matrix(e)
M = np.matrix(m)
E*M


Out[130]:
matrix([[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.]])

In [132]:
M[0,:]


Out[132]:
matrix([[0, 1]])

In [131]:
M[:,0]


Out[131]:
matrix([[0],
        [2],
        [4]])

Use sparse Matrices for big tables:


In [ ]:
sparseM = scipy.sparse.coo_matrix((values),(row_indices,column_indices)),shape=(max(row_indices)+1,max(column_indices)+1))

Broadcasting:


In [10]:
arr_1d = np.random.normal(size=3)
arr_1d


Out[10]:
array([-1.58033814, -0.29357562,  0.45257015])

In [11]:
arr_1d[:, np.newaxis]


Out[11]:
array([[-1.58033814],
       [-0.29357562],
       [ 0.45257015]])

In [9]:
arr_1d[np.newaxis, :]


Out[9]:
array([[-1.71826784, -0.81097857, -0.0277865 ]])

In [12]:
arr = randn(3, 4, 5)
arr.shape


Out[12]:
(3L, 4L, 5L)

In [17]:
depth_means = arr.mean(2)
print(depth_means.shape)
print(depth_means[:, :, np.newaxis].shape)


(3L, 4L)
(3L, 4L, 1L)

In [15]:
demean = arr - depth_means[:, :, np.newaxis]
demean.mean(2)


Out[15]:
array([[  0.00000000e+00,  -6.66133815e-17,  -2.22044605e-17,
          0.00000000e+00],
       [  1.11022302e-17,   0.00000000e+00,   0.00000000e+00,
         -6.66133815e-17],
       [  2.22044605e-17,  -4.44089210e-17,   0.00000000e+00,
          4.44089210e-17]])

In [22]:
arr = np.zeros((4, 3))
arr[:2] = [[-1.37], [0.509]]
arr


Out[22]:
array([[-1.37 , -1.37 , -1.37 ],
       [ 0.509,  0.509,  0.509],
       [ 0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ]])

In [24]:
arr = np.zeros((4, 3))
arr[:,:2] = [-1.37, 0.509]
arr


Out[24]:
array([[-1.37 ,  0.509,  0.   ],
       [-1.37 ,  0.509,  0.   ],
       [-1.37 ,  0.509,  0.   ],
       [-1.37 ,  0.509,  0.   ]])

Series & DataFrames


In [6]:
ser = Series([1,4,"three",5.])
print(ser.values)


Int64Index([0, 1, 2, 3], dtype=int64)
[1 4 'three' 5.0]

In [9]:
ser  = Series([1,4,"three",5.], index=['dim1','dim2','name','val'])
print(ser.index)


Index([dim1, dim2, name, val], dtype=object)

In [18]:
print(ser['val'])
print(ser[(ser == 5)|(ser == 1)])


5.0
dim1    1
val     5
dtype: object

In [19]:
ser2 = Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000})
print(ser2)


Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [21]:
ser3 = Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}, index = ['Texas','Ohio','Berlin','Oregon','Utah'])
print(ser3)


Texas     71000
Ohio      35000
Berlin      NaN
Oregon    16000
Utah       5000
dtype: float64

In [22]:
ser3.index = ['T','O','B','OR','U']
print(ser3)


T     71000
O     35000
B       NaN
OR    16000
U      5000
dtype: float64

In [80]:
ser4 = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(ser4)
print(ser4.reindex(['a','aa','b','c','d','e'],fill_value=99))


d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a     -5.3
aa    99.0
b      7.2
c      3.6
d      4.5
e     99.0
dtype: float64

In [81]:
ser5 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
ser5.reindex(range(6), method='ffill')


Out[81]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [75]:
ser4


Out[75]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [45]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],'popp': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = DataFrame(data, columns=['year', 'state', 'popp', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
df


Out[45]:
year state popp debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

In [49]:
df.ix['four'] # == df.ix[3]


Out[49]:
year       2001
state    Nevada
popp        2.4
debt        NaN
Name: four, dtype: object

In [71]:
print(df.ix['one','state'])
df.ix[['one','two'],['state','popp']]


Ohio
Out[71]:
state popp
one Ohio 1.5
two Ohio 1.7

In [48]:
df['popp'] == df.popp


Out[48]:
one      True
two      True
three    True
four     True
five     True
Name: popp, dtype: bool

In [51]:
df['newcol'] = np.arange(5)
df


Out[51]:
year state popp debt newcol
one 2000 Ohio 1.5 NaN 0
two 2001 Ohio 1.7 NaN 1
three 2002 Ohio 3.6 NaN 2
four 2001 Nevada 2.4 NaN 3
five 2002 Nevada 2.9 NaN 4

In [53]:
del df['debt']
df


Out[53]:
year state popp newcol
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4

In [85]:
df.T.drop(['year','newcol'])


Out[85]:
one two three four five
state Ohio Ohio Ohio Nevada Nevada
popp 1.5 1.7 3.6 2.4 2.9

In [83]:
df.T.drop('one',axis=1)


Out[83]:
two three four five
year 2001 2002 2001 2002
state Ohio Ohio Nevada Nevada
popp 1.7 3.6 2.4 2.9
newcol 1 2 3 4

In [56]:
df.values


Out[56]:
array([[2000, 'Ohio', 1.5, 0],
       [2001, 'Ohio', 1.7, 1],
       [2002, 'Ohio', 3.6, 2],
       [2001, 'Nevada', 2.4, 3],
       [2002, 'Nevada', 2.9, 4]], dtype=object)

In [67]:
2.4 in df.values


Out[67]:
True

In [91]:
df.T.ix['year':'popp'] # last element ie 'popp' is included


Out[91]:
one two three four five
year 2000 2001 2002 2001 2002
state Ohio Ohio Ohio Nevada Nevada
popp 1.5 1.7 3.6 2.4 2.9

In [96]:
df[df.popp > 2]


Out[96]:
year state popp newcol
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4

In [100]:
df[df.popp>2].newcol.sum()


Out[100]:
9

In [5]:
dn = DataFrame(np.arange(12).reshape(4,3),columns=['col2','col1','col3'],index=['1','2','4','3'])
dn


Out[5]:
col2 col1 col3
1 0 1 2
2 3 4 5
4 6 7 8
3 9 10 11

In [6]:
func_on_series = lambda x:x.max()
dn.apply(func_on_series)


Out[6]:
col2     9
col1    10
col3    11
dtype: int64

In [7]:
dn.apply(func_on_series,axis=1)


Out[7]:
1     2
2     5
4     8
3    11
dtype: int64

In [8]:
dn.applymap(lambda x:x+1) # == dn + 1


Out[8]:
col2 col1 col3
1 1 2 3
2 4 5 6
4 7 8 9
3 10 11 12

In [9]:
dn.sort('col1',ascending=0)


Out[9]:
col2 col1 col3
3 9 10 11
4 6 7 8
2 3 4 5
1 0 1 2

In [10]:
dn.sort(['col1','col2'],ascending=[0,1])


Out[10]:
col2 col1 col3
3 9 10 11
4 6 7 8
2 3 4 5
1 0 1 2

In [121]:
dn.sort_index(ascending=False).sort_index(axis=1)


Out[121]:
col1 col2 col3
4 7 6 8
3 10 9 11
2 4 3 5
1 1 0 2

In [124]:
dn.col1.order()


Out[124]:
1     1
2     4
4     7
3    10
Name: col1, dtype: int64

In [133]:
dn.col2 = [34,34,3,99]
dn


Out[133]:
col2 col1 col3
1 34 1 2
2 34 4 5
4 3 7 8
3 99 10 11

In [134]:
dn.sort_index(by=['col2','col1'],ascending=False)


Out[134]:
col2 col1 col3
3 99 10 11
2 34 4 5
1 34 1 2
4 3 7 8

In [142]:
dn2 = dn.reindex([str(x) for x in [1,2,4,3,4]])
dn2


Out[142]:
col2 col1 col3
1 34 1 2
2 34 4 5
4 3 7 8
3 99 10 11
4 3 7 8

In [144]:
dn2.ix['4']


Out[144]:
col2 col1 col3
4 3 7 8
4 3 7 8

In [21]:
dn


Out[21]:
col2 col1 col3
1 0 1 2
2 3 4 5
4 6 7 8
3 9 10 11

In [22]:
dn.idxmax()


Out[22]:
col2    3
col1    3
col3    3
dtype: object

In [153]:
dn2.count()


Out[153]:
col2    5
col1    5
col3    5
dtype: int64

In [23]:
Series([1,2,np.NAN,2]).count()


Out[23]:
3

In [24]:
dn[dn==4] = np.nan
dn


Out[24]:
col2 col1 col3
1 0 1 2
2 3 NaN 5
4 6 7 8
3 9 10 11

In [25]:
dn.col1.isnull()


Out[25]:
1    False
2     True
4    False
3    False
Name: col1, dtype: bool

In [28]:
np.where(dn.col1.isnull(),-100,dn.col1)


Out[28]:
1      1
2   -100
4      7
3     10
Name: col1, dtype: float64

In [162]:
type(12.8)


Out[162]:
float

In [15]:
data = DataFrame([[1., 6.5, 3.], [1., np.NAN, np.NAN],[np.NAN, np.NAN, np.NAN], [np.NAN, 6.5, 3.]])
data


Out[15]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3

In [17]:
data.fillna(method='bfill') # oder ffill


Out[17]:
0 1 2
0 1 6.5 3
1 1 6.5 3
2 NaN 6.5 3
3 NaN 6.5 3

In [10]:
data.dropna(how='all')


Out[10]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
3 NaN 6.5 3

In [14]:
data.fillna(0, inplace=True) # inplace=True modifiziert die Originaldaten
data


Out[14]:
0 1 2
0 1 6.5 3
1 1 0.0 0
2 0 0.0 0
3 0 6.5 3

In [141]:
data = DataFrame({'vals':rand.normal(0,1,size=10),'weights':rand.randint(2,size=10)})
data['results']=data.vals*data.weights
data


Out[141]:
vals weights results
0 -0.333944 0 -0.000000
1 -0.087153 0 -0.000000
2 -0.817727 0 -0.000000
3 0.525605 1 0.525605
4 -0.804634 0 -0.000000
5 2.216917 1 2.216917
6 -1.486997 1 -1.486997
7 -1.054359 0 -0.000000
8 0.768125 0 0.000000
9 -0.724449 0 -0.000000

In [142]:
data.vals.corr(data.results)


Out[142]:
0.8647573352382677

In/Out


In [ ]:
pd.read_csv('some.csv', names=['a', 'b', 'c', 'd', 'message'], header=None) # otherwise first line will define column names
pd.read_table('some.txt', sep='\t', index_col='message' ) # column messages will become index of table
pd.read_csv('some.csv', na_values = {'message':'xxx','a':'NULL'}, converters={'message':lambda x:x[0:1]}) # for each column it can be defined what entry represents a NaN and how it should be parsed by a custom function

In [ ]:
data.to_csv(sys.stdout, index=False, header=False, sep='|') # write

In [ ]:
# JSON: read in lists of dicts to get rows where keys become columns
tweets = DataFrame(list_of_dicts, columns=list_of_dicts[0].keys())

data.save('binary.data')      # pandas only format
data = pd.load('binary.data')

store = pd.HDFStore('mydata.h5') # common scientific format HDF5 good for: write once, read many

In [ ]:
from pandas.io.parsers import TextParser
from lxml.html import parse from urllib2 import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))
doc = parsed.getroot()

# create DataFrame from table tag in html file:
def parse_options_data( doc ):
    tables = doc.findall('.//table') # select table
    table = tables[0]
    rows = table.findall('.//tr') # select rows of table
    header = _unpack(rows[0], kind='th') # extract header row
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names=header).get_chunk() # create DataFrame

In [ ]:
import pandas.io.sql as sql
import sqlite3
con = sqlite3.connect(':myDB:')
sql.read_frame('select * from test', con)

Combining, Merging and Rearranging of Data


In [ ]:
#    a.combine_first(b)
#    a's values prioritized, use values from b to fill holes

# data1.update(data2, overwrite=False)

In [ ]:
pd.merge(df1, df2, on='key', how='left') # use a list to join with multiple keys, 'inner' is default, 'right', 'outer' are the other option of the how parameter
df1.join(df2, on='key') # same but in a neat way
df1.join([df2,df3]) # join three DataFrames

pd.merge(left1, right1, left_on='key', right_index=True) # joins tables by using column 'key' of left table to match with index of right table 
left1.join(right1) # same but in a neat way

In [7]:
arr = np.arange(12).reshape((3, 4))
print(arr)
print(np.concatenate([arr,arr],axis=1))
print(np.concatenate([arr,arr]))


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

In [39]:
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4, 5], index=['c', 'd', 'e', 'b'])

def psdf(data):
    print("/////////////////\n%s" % data)
    
psdf(pd.concat([s2, s1]))
psdf(pd.concat([s1, s2],axis=1))
psdf(pd.concat([s1, s2],axis=1,join='inner'))


/////////////////
c    2
d    3
e    4
b    5
a    0
b    1
dtype: int64
/////////////////
    0   1
a   0 NaN
b   1   5
c NaN   2
d NaN   3
e NaN   4
/////////////////
   0  1
b  1  5

In [31]:
# rotating between row and column indices
data = DataFrame(np.arange(6).reshape((2, 3)), index=pd.Index(['Ohio', 'Colorado'], name='state'), columns=pd.Index(['one', 'two', 'three'], name='number'))
psdf(data)
psdf(data.stack())
psdf(data.stack().unstack())


/////////////////
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
/////////////////
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64
/////////////////
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [56]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})
psdf(data)
psdf(data.drop_duplicates())


/////////////////
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4
/////////////////
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4

In [60]:
data['k3'] = range(7)
psdf(data)
psdf(data.replace([1,2],[111,222]))
psdf(data.replace({1:111,2:222})) # should do the same but is not working


/////////////////
    k1  k2  k3
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6
/////////////////
    k1   k2   k3
0  one  111    0
1  one  111  111
2  one  222  222
3  two    3    3
4  two    3    4
5  two    4    5
6  two    4    6
/////////////////
    k1  k2  k3
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6

Histograms Logic:


In [47]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats


Out[47]:
Categorical: 
array(['(18, 25]', '(18, 25]', '(18, 25]', '(25, 35]', '(18, 25]',
       '(18, 25]', '(35, 60]', '(25, 35]', '(60, 100]', '(35, 60]',
       '(35, 60]', '(25, 35]'], dtype=object)
Levels (4): Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype=object)

In [50]:
cats.labels


Out[50]:
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1])

In [51]:
cats.levels


Out[51]:
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)

In [53]:
pd.value_counts(cats)


Out[53]:
(19.959, 30.25]    6
(30.25, 40.5]      3
(40.5, 50.75]      2
(50.75, 61]        1
dtype: int64

In [54]:
cats = pd.cut(ages,4)  # create equally sized bins
print(cats.levels)
pd.value_counts(cats)


Index([(19.959, 30.25], (30.25, 40.5], (40.5, 50.75], (50.75, 61]], dtype=object)
Out[54]:
(19.959, 30.25]    6
(30.25, 40.5]      3
(40.5, 50.75]      2
(50.75, 61]        1
dtype: int64

In [55]:
cats = pd.qcut(ages,4)  # create bins with equilly sized number of elements
print(cats.levels)
pd.value_counts(cats)


Index([[20, 22.75], (22.75, 29], (29, 38], (38, 61]], dtype=object)
Out[55]:
(29, 38]       3
[20, 22.75]    3
(38, 61]       3
(22.75, 29]    3
dtype: int64

String Manipulations


In [117]:
import re
data = Series({'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan})
data


Out[117]:
Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

In [82]:
data.str[:2]


Out[82]:
Dave      da
Rob       ro
Steve     st
Wes      NaN
dtype: object

In [83]:
data.str.join(sep=".")


Out[83]:
Dave     d.a.v.e.@.g.o.o.g.l.e...c.o.m
Rob          r.o.b.@.g.m.a.i.l...c.o.m
Steve    s.t.e.v.e.@.g.m.a.i.l...c.o.m
Wes                                NaN
dtype: object

In [77]:
data.c1.str.replace("c",'999')


Out[77]:
0    ab999
1    999de
2      fgh
Name: c1, dtype: object

In [119]:
matches = data.str.match('([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})', flags=re.IGNORECASE)
matches.str[1]


Out[119]:
Dave     google
Rob       gmail
Steve     gmail
Wes         NaN
dtype: object

Plotting


In [16]:
randomwalk = rand.randn(40)
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax1.plot(randomwalk.cumsum(),'o-k')


Out[16]:
[<matplotlib.lines.Line2D at 0x108b30d90>]

In [17]:
comp1 = np.random.normal(0, 1, size=200) # N(0, 1)
comp2 = np.random.normal(10, 2, size=200) # N(10, 4)
values = Series(np.concatenate([comp1, comp2]))
values.hist(bins=100, alpha=0.3, color='k', normed=True)
values.plot(kind='kde', style='k--')


Out[17]:
<matplotlib.axes.AxesSubplot at 0x108b3d8d0>

In [18]:
# making some histogram:
s = Series(np.random.randn(100000))
hist = np.histogram(s,bins=100)
plt.bar(hist[1][:-1],hist[0],width=.1)


Out[18]:
<Container object of 100 artists>

In [154]:
tmpdata = DataFrame(np.random.normal(size=(100,3)))
tmpdata['cat'] = 'A'
data = DataFrame(np.random.normal(10,2,size=(150,3)))
data['cat']='B'
tmpdata2 = DataFrame(np.random.normal(50,2,size=(150,3)))
tmpdata2['cat']='C'
data = pd.concat([tmpdata,data,tmpdata2])
del tmpdata
data[:4]


Out[154]:
0 1 2 cat
0 -0.281100 0.792052 1.172131 A
1 -0.561721 -1.543955 -0.663505 A
2 -0.864066 -0.940743 0.084267 A
3 0.490830 -1.012574 -0.456108 A

In [19]:
hlp.plot_categorical_data(data,[0,1])


Out[19]:
(<matplotlib.figure.Figure at 0x108b46a90>,
 <matplotlib.axes.AxesSubplot at 0x108e0d910>)

In [20]:
fig, ax = plt.subplots()
ax.scatter(data[data.cat=='A'][0],data[data.cat=='A'][1],marker='.',facecolor='b')
ax.scatter(data[data.cat=='B'][0],data[data.cat=='B'][1],marker='x',facecolor='r')
plt.show()



In [21]:
pd.scatter_matrix(data)


Out[21]:
array([[<matplotlib.axes.AxesSubplot object at 0x108eefe90>,
        <matplotlib.axes.AxesSubplot object at 0x109876790>,
        <matplotlib.axes.AxesSubplot object at 0x109895d50>],
       [<matplotlib.axes.AxesSubplot object at 0x109881810>,
        <matplotlib.axes.AxesSubplot object at 0x109a31390>,
        <matplotlib.axes.AxesSubplot object at 0x109a516d0>],
       [<matplotlib.axes.AxesSubplot object at 0x109a39bd0>,
        <matplotlib.axes.AxesSubplot object at 0x109a88c90>,
        <matplotlib.axes.AxesSubplot object at 0x109aa8fd0>]], dtype=object)

Data Grouping and Aggregation


In [18]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a','a'],'key2' : ['one', 'two', 'one', 'two', 'one','three'],'data1' : np.random.randn(6),'data2' : np.random.randn(6)})
df


Out[18]:
data1 data2 key1 key2
0 0.739360 0.621588 a one
1 -0.106926 0.235336 a two
2 1.403078 1.752346 b one
3 -0.444629 -0.142198 b two
4 0.277675 0.737866 a one
5 0.424053 -0.541420 a three

In [35]:
df.groupby('key1').min()


Out[35]:
data1 data2 key2
key1
a -0.106926 -0.541420 one
b -0.444629 -0.142198 one

In [22]:
for name,group in df.groupby('key2'):
    print("Group %s:" % name)
    print(group)


Group one:
      data1     data2 key1 key2
0  0.739360  0.621588    a  one
2  1.403078  1.752346    b  one
4  0.277675  0.737866    a  one
Group three:
      data1    data2 key1   key2
5  0.424053 -0.54142    a  three
Group two:
      data1     data2 key1 key2
1 -0.106926  0.235336    a  two
3 -0.444629 -0.142198    b  two

In [21]:
pieces = dict(list(df.groupby('key1')))
pieces['a']


Out[21]:
data1 data2 key1 key2
0 0.739360 0.621588 a one
1 -0.106926 0.235336 a two
4 0.277675 0.737866 a one
5 0.424053 -0.541420 a three

In [37]:
df.groupby('key2')['data1'].count() # == ... .size()


Out[37]:
key2
one      3
three    1
two      2
dtype: int64

In [29]:
people = DataFrame(np.random.randn(5, 5),columns=['a', 'b', 'c', 'd', 'e'],index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people


Out[29]:
a b c d e
Joe 0.708303 0.340645 -0.351854 -1.377107 0.019698
Steve 1.367789 -0.205284 1.639879 0.262293 -1.069834
Wes 1.302231 1.047179 -0.317145 -2.083820 0.058323
Jim 1.095670 -0.898387 -0.656303 -0.885540 2.002418
Travis -0.025436 -0.371457 0.611380 0.578107 -0.124972

In [30]:
# Grouping by groups of keys:
people.groupby({'a':'aa','b':'bb','c':'aa','d':'aa','e':'bb'},axis=1).sum()


Out[30]:
aa bb
Joe -1.020658 0.360343
Steve 3.269961 -1.275117
Wes -1.098734 1.105502
Jim -0.446173 1.104031
Travis 1.164051 -0.496429

In [40]:
# Grouping by function:
people.groupby(lambda idx:idx[0]).first()


Out[40]:
a b c d e
J 0.708303 0.340645 -0.351854 -1.377107 0.019698
S 1.367789 -0.205284 1.639879 0.262293 -1.069834
T -0.025436 -0.371457 0.611380 0.578107 -0.124972
W 1.302231 1.047179 -0.317145 -2.083820 0.058323

Aggregation functions convert an array to a scalar.


In [34]:
# call multiple and customized aggregation functions:
df.groupby('key2').agg([('Mittel','mean'),'sum',('User function',(lambda arr: min(arr)))])


Out[34]:
data1 data2
Mittel sum User function Mittel sum User function
key2
one 0.806705 2.420114 0.277675 1.037267 3.111800 0.621588
three 0.424053 0.424053 0.424053 -0.541420 -0.541420 -0.541420
two -0.275777 -0.551555 -0.444629 0.046569 0.093138 -0.142198

In [44]:
# call different aggregation functions on different columns:
df.groupby('key2').agg({'data1':[min,max],'data2':sum})


Out[44]:
data1 data2
min max sum
key2
one 0.277675 1.403078 3.111800
three 0.424053 0.424053 -0.541420
two -0.444629 -0.106926 0.093138

Transformation functions convert an array to an array of equal size.


In [53]:
df.groupby('key1').transform(abs)


Out[53]:
data1 data2
0 0.739360 0.621588
1 0.106926 0.235336
2 1.403078 1.752346
3 0.444629 0.142198
4 0.277675 0.737866
5 0.424053 0.541420

In [54]:
def demean(arr):
    return arr - arr.mean()

df.groupby('key1').transform(demean)


Out[54]:
data1 data2
0 0.405820 0.358246
1 -0.440466 -0.028006
2 0.923854 0.947272
3 -0.923854 -0.947272
4 -0.055866 0.474524
5 0.090512 -0.804763

In [55]:
df.groupby('key1').transform(demean).mean()


Out[55]:
data1    1.850372e-17
data2    1.850372e-17
dtype: float64

Apply function can return panda objects that can be concatenated


In [56]:
df.groupby('key1').apply(lambda dataframe:dataframe.describe())


Out[56]:
data1 data2
key1
a count 4.000000 4.000000
mean 0.333541 0.263343
std 0.351195 0.577910
min -0.106926 -0.541420
25% 0.181525 0.041147
50% 0.350864 0.428462
75% 0.502880 0.650658
max 0.739360 0.737866
b count 2.000000 2.000000
mean 0.479224 0.805074
std 1.306526 1.339645
min -0.444629 -0.142198
25% 0.017298 0.331438
50% 0.479224 0.805074
75% 0.941151 1.278710
max 1.403078 1.752346

In [58]:
frame = DataFrame({'data1': np.random.randn(1000),'data2': np.random.randn(1000)})
factor = pd.cut(frame.data1, 4)
factor[:10]


Out[58]:
Categorical: 
array(['(1.349, 2.816]', '(-1.583, -0.117]', '(-0.117, 1.349]',
       '(-1.583, -0.117]', '(-1.583, -0.117]', '(-1.583, -0.117]',
       '(-0.117, 1.349]', '(-0.117, 1.349]', '(-0.117, 1.349]',
       '(-1.583, -0.117]'], dtype=object)
Levels (4): Index(['(-3.0546, -1.583]', '(-1.583, -0.117]',
                   '(-0.117, 1.349]', '(1.349, 2.816]'], dtype=object)

In [67]:
grouped = frame.data2.groupby(factor)
grouped.min()


Out[67]:
data1
(-3.0546, -1.583]   -2.788622
(-1.583, -0.117]    -3.383861
(-0.117, 1.349]     -2.574235
(1.349, 2.816]      -2.831513
dtype: float64

In [80]:
def get_stats(group,some_parameter):
    return {'min': group.min() + some_parameter, 'max': group.max(),'count': group.count(), 'mean': group.mean()}

grouped.apply(get_stats, some_parameter=100).unstack()


Out[80]:
count max mean min
data1
(-3.0546, -1.583] 55 1.771513 -0.023555 97.211378
(-1.583, -0.117] 397 2.776994 0.009898 96.616139
(-0.117, 1.349] 451 2.804875 -0.046946 97.425765
(1.349, 2.816] 97 2.460343 -0.162391 97.168487

Pivot and Cross Tables


In [184]:
tips = pd.read_csv('ch08/tips.csv')
tips[:10]


Out[184]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2

In [163]:
tips.pivot_table(rows=['sex', 'smoker']) # == tips.groupby(['sex','smoker']).mean()


Out[163]:
size tip total_bill
sex smoker
Female No 2.592593 2.773519 18.105185
Yes 2.242424 2.931515 17.977879
Male No 2.711340 3.113402 19.791237
Yes 2.500000 3.051167 22.284500

In [177]:
tips.pivot_table(['tip','total_bill'],rows=['day','smoker'],cols='sex')


Out[177]:
tip total_bill
sex Female Male Female Male
day smoker
Fri No 3.125000 2.500000 19.365000 17.475000
Yes 2.682857 2.741250 12.654286 20.452500
Sat No 2.724615 3.256563 19.003846 19.929063
Yes 2.868667 2.879259 20.266667 21.837778
Sun No 3.329286 3.115349 20.824286 20.403256
Yes 3.500000 3.521333 16.540000 26.141333
Thur No 2.459600 2.941500 16.014400 18.486500
Yes 2.990000 3.058000 19.218571 19.171000

In [183]:
tips.pivot_table(['tip','total_bill'],rows=['day','smoker'],cols='sex',aggfunc=sum,margins=True)


Out[183]:
tip total_bill
sex Female Male All Female Male All
day smoker
Fri No 6.25 5.00 11.25 38.73 34.95 73.68
Yes 18.78 21.93 40.71 88.58 163.62 252.20
Sat No 35.42 104.21 139.63 247.05 637.73 884.78
Yes 43.03 77.74 120.77 304.00 589.62 893.62
Sun No 46.61 133.96 180.57 291.54 877.34 1168.88
Yes 14.00 52.82 66.82 66.16 392.12 458.28
Thur No 61.49 58.83 120.32 400.36 369.73 770.09
Yes 20.93 30.58 51.51 134.53 191.71 326.24
All 246.51 485.07 731.58 1570.95 3256.82 4827.77

In [193]:
# counting how often does a certain pair of values occures:
ser1 = Series(rand.randint(0,2,10))
ser2 = Series(rand.randint(10,12,10))
ser3 = Series(rand.randint(20,22,10))
dt = DataFrame([ser1,ser2,ser3])
print(dt)
pd.crosstab(ser1,ser2,margins=True)


    0   1   2   3   4   5   6   7   8   9
0   1   0   0   0   1   1   1   0   1   0
1  10  11  10  11  10  10  10  10  10  11
2  20  20  21  20  20  21  21  21  21  21
Out[193]:
col_0 10 11 All
row_0
0 2 3 5
1 5 0 5
All 7 3 10

In [194]:
pd.crosstab([ser1,ser2],ser3,margins=True)


Out[194]:
col_0 20 21 All
row_0 row_1
0 10 0 2 2
11 2 1 3
1 10 2 3 5
All 4 6 10

Time Series


In [42]:
now = datetime.now()
now.year, now.month, now.day, now.hour, now.minute


Out[42]:
(2013, 8, 30, 11, 45)

In [51]:
delta = now-datetime.now()
delta.days


Out[51]:
-1

In [53]:
now.strftime('%d.%m.%Y')


Out[53]:
'30.08.2013'

In [56]:
datetime.strptime("2012-12-15", '%Y-%m-%d')


Out[56]:
datetime.datetime(2012, 12, 15, 0, 0)

In [67]:
dateutil.parser.parse("07:56:59 / 12. Sep 1976", dayfirst=True)


Out[67]:
datetime.datetime(1976, 9, 12, 7, 56, 59)

In [72]:
# parse arrays of date strings with pandas:
pd.to_datetime(['7/6/2011',None])


Out[72]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-07-06 00:00:00, NaT]
Length: 2, Freq: None, Timezone: None

In [103]:
ts = Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts[:5]


Out[103]:
2000-01-01   -1.977406
2000-01-02   -1.265072
2000-01-03   -0.156686
2000-01-04    0.429408
2000-01-05   -0.104964
Freq: D, dtype: float64

In [104]:
ts['2000-02']


Out[104]:
2000-02-01    0.322620
2000-02-02    0.565578
2000-02-03    0.715117
2000-02-04   -0.666173
2000-02-05   -0.600000
2000-02-06   -0.987512
2000-02-07    2.849058
2000-02-08    0.025071
2000-02-09    0.099177
2000-02-10    1.119427
2000-02-11   -1.600575
2000-02-12    0.606665
2000-02-13   -0.344078
2000-02-14   -0.080394
2000-02-15   -0.445799
2000-02-16   -1.199052
2000-02-17   -0.408375
2000-02-18   -0.388830
2000-02-19   -0.339986
2000-02-20    1.979030
2000-02-21    0.742477
2000-02-22   -0.474137
2000-02-23   -0.000435
2000-02-24   -0.431915
2000-02-25    0.807162
2000-02-26   -0.719529
2000-02-27   -1.702265
2000-02-28   -0.412564
2000-02-29    1.988327
Freq: D, dtype: float64

In [105]:
ts['2000-01-05':'2000-01-08']


Out[105]:
2000-01-05   -0.104964
2000-01-06    0.242240
2000-01-07    0.007631
2000-01-08   -2.602129
Freq: D, dtype: float64

In [ ]:
# use ts.between_time to select between hours of a day

In [84]:
pd.date_range('1/1/2000',periods=2,freq="D")


Out[84]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, 2000-01-02 00:00:00]
Length: 2, Freq: D, Timezone: None

In [91]:
pd.date_range('1/1/2000',periods=3,freq="WOM-3FRI") # every Friday of the 3rd week of a month ie lots of options!


Out[91]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-21 00:00:00, ..., 2000-03-17 00:00:00]
Length: 3, Freq: WOM-3FRI, Timezone: None

In [90]:
ts[:'2000-01-2'].resample('H')


Out[90]:
2000-01-01 00:00:00   -0.174162
2000-01-01 01:00:00         NaN
2000-01-01 02:00:00         NaN
2000-01-01 03:00:00         NaN
2000-01-01 04:00:00         NaN
2000-01-01 05:00:00         NaN
2000-01-01 06:00:00         NaN
2000-01-01 07:00:00         NaN
2000-01-01 08:00:00         NaN
2000-01-01 09:00:00         NaN
2000-01-01 10:00:00         NaN
2000-01-01 11:00:00         NaN
2000-01-01 12:00:00         NaN
2000-01-01 13:00:00         NaN
2000-01-01 14:00:00         NaN
2000-01-01 15:00:00         NaN
2000-01-01 16:00:00         NaN
2000-01-01 17:00:00         NaN
2000-01-01 18:00:00         NaN
2000-01-01 19:00:00         NaN
2000-01-01 20:00:00         NaN
2000-01-01 21:00:00         NaN
2000-01-01 22:00:00         NaN
2000-01-01 23:00:00         NaN
2000-01-02 00:00:00   -0.024849
Freq: H, dtype: float64

In [92]:
# creating info about (percental) change of data points use something like: ts / ts.shift(1) - 1

In [98]:
import pytz
ts = pd.date_range('3/9/2012 9:30', periods=3, freq='D').tz_localize('CET') # ==  pd.date_range('3/9/2012 9:30', periods=3, freq='D', tz='CET')
ts


Out[98]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-09 09:30:00, ..., 2012-03-11 09:30:00]
Length: 3, Freq: D, Timezone: CET

In [100]:
ts.tz_convert('US/Eastern')


Out[100]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-09 03:30:00, ..., 2012-03-11 04:30:00]
Length: 3, Freq: D, Timezone: US/Eastern

In [101]:
ts.tz_convert('Asia/Shanghai')


Out[101]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-09 16:30:00, ..., 2012-03-11 16:30:00]
Length: 3, Freq: D, Timezone: Asia/Shanghai

In [102]:
# periods are time spans:
p = pd.Period(2007, freq='A-DEC') # ie. (1.1.2007 - 31.12.2007)
p + 1 # ie. (1.1.2008 - 31.12.2008)


Out[102]:
Period('2008', 'A-DEC')

Financial Data Analysis


In [22]:
import pandas.io.data as web
data = web.get_data_yahoo('GOOG', '2013-01-01')

In [23]:
data[-10:]


Out[23]:
Open High Low Close Volume Adj Close
Date
2013-10-25 1028.82 1028.82 1010.74 1015.20 2030500 1015.20
2013-10-28 1015.20 1023.43 1012.99 1015.00 1158700 1015.00
2013-10-29 1019.10 1036.94 1013.50 1036.24 1605000 1036.24
2013-10-30 1037.43 1037.51 1026.00 1030.42 1324100 1030.42
2013-10-31 1028.93 1041.52 1023.97 1030.58 1616400 1030.58
2013-11-01 1031.79 1036.00 1025.10 1027.04 1283300 1027.04
2013-11-04 1031.50 1032.37 1022.03 1026.11 1138800 1026.11
2013-11-05 1020.35 1031.65 1017.42 1021.52 1181400 1021.52
2013-11-06 1025.60 1027.00 1015.37 1022.75 912900 1022.75
2013-11-07 1022.61 1023.93 1007.64 1007.95 1679600 1007.95

In [ ]: